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Introduction 

Variable  selection  is  a  critical  step  in  constructing  statistical  regression,  pattern 
classification,  or  time  series  models  that  are  capable  of  optimum  generalization  performance. 
Since  the  project  got  started  in  February  1996,  we  have  implemented  the  prototype  K-test  as 
proposed,  carried  out  extensive  testing  on  regression  and  time  series  problems,  and  developed 
a  selection  criterion  based  upon  unsupervised  clustering  methods.  The  latter  can  be  applied  to 
both  regression  and  classification  type  problems. 

Under  ONR  sponsorship,  a  number  of  criterion  functions  have  been  devised  and  tested 
for  developing  the  variable  selection  methodologies.  The  work  on  this  project  has  been 
conducted  by  Hong  Pi  and  John  Moody.  Since  Hong  Pi  has  taken  a  job  in  industry,  Howard 
Yang  (from  Amari’s  research  group  in  Tokyo)  will  continue  working  on  the  project  in  place 
ofHong. 


Input  Selection  for  Non- Parametric  Regression, 
Classification,  and  Time  Series  Modeling  -  Status 

Report 

Generally,  an  input  variable  selection  algorithm  consists  of  two  parts: 

A.  A  criterion  function  measuring  the  optimality  of  subsets  of  variables. 

B.  A  search  algorithm  that  searches  through  the  space  of  all  possible  subsets  of  variables 

and  finds  the  “best”  subset,  based  on  the  criterion  function  in  A. 

A  number  of  criterion  functions  have  been  devised  and  tested  for  developing  the  variable 
selection  methodologies. 

Estimate  of  Residual  Variance 

The  K-Test 

The  K-statistic  averages  over  the  variances  of  a  “local  neighborhood”  chosen  from  the  K 
nearest  neighbors. 

dy(A/cAriv)  =  {{{y  -y)^)i<NN) 

=  2^{{y  -  yT)i<NN) 

^  ~  f{'^)f)i<NN)  +  2(((^  “  r')^)KNN) 

=  ^(((/(x)  •  -x))^)r<NN)  +  i^r) 

+  P  •  A/vatn  (1) 

where 

A/v'ivyv  =  (((x'  —  x)^)knn)  (2) 

A  linear  extrapolation  is  found  to  be  useful  to  improve  the  variance  estimate. 
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Linear  extrapolation 


Linearly  fitting  ary{As)  and  extrapolating,  the  intercept  gives  a  variance  estimate 

(dl)  =  =  0) 


Some  numerical  results 

The  various  methods  for  variance  estimate  are  tested  on  a  artificial  data  set  of  the  type 

^inp 

y  =  ^  sin(z7ra;,-  +  (pi)  +  r  (4) 

1=1 

Figure  1  shows  a  2-inputs  example.  In  this  case  all  methods  give  good  estimates  on  the 
noise  variance. 


Delta 

Figure  1:  The  variance  statistics  versus  the  Delta.  The  data  set  used  features  n;„p  =  2,  N  =  300,  =  0.04 

(marked  by  the  horizontal  line). 

When  the  input  dimension  is  increased  to  three,  as  shown  in  figure  2,  the  AT-nearest 
neighbor  based  methods  give  misleading  results.  The  delta-test  does  well  probing  the 
small  A  region.  However  if  a  linear  extrapolation  is  used  it  over-shoots  the  target  variance. 

It  is  found  that 
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Figure  2:  The  variance  statistics  versus  the  Delta.  The  data  set  used  features  riinp  =  3,  =  400, 

=  0.055  (marked  by  the  horizontal  line). 


•  For  variance  estimate,  the  region  of  A  -4  0  is  crucial. 

•  The  linear  extrapolation  is  useful  but  can  be  very  unreliable.  Further  research  is 
needed  to  improve  the  reliability. 

•  It  is  possible  to  alter  the  JT-test  somewhat  by  averaging  over  the  local  neighborhoods 
of  like  “sizes”  rather  than  like  K's.  This  approach  might  yield  better  results. 


Variable  Sensitivity  Estimate 


Adding  noise  to  one  of  the  variables,  its  effect  on  variance  estimate  gives  a  measure  on 
variable  sensitivity. 

=  x,  +  e  (5) 

y  =  /(x)  +r  =  /(x^)  -  +  r  (6) 

where  denotes  the  set  of  variables  {a:i,X2...x,-i,^i,a:i+i...a;£)}. 
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The  variance  estimate 


=  ^((y  -  y'f\  |X5  -  x'l  <  5) 

=  \m^,)  -  fi^)?)s  +  ^<(^)  V  +  liir  -  rr)5 
=  <^> 

The  linear  extrapolation  results  in 

aJ(A.  =  0)  =  (^;>  +  i(g)V  (8) 

Fitting  two  values  ef  and  produce  readings  on  both  (d^)  and  5,-  =  2((^)  )si  where  the 
slope  Si  provides  a  sensitivity  measure  on  the  i:th  input. 

Taking  e  ^  — e  helps  reducing  some  of  the  biases. 


Selection  Criterion  Based  on  Clustering 

The  variance  estimates  are  applicable  only  to  regression  problems.  An  alternative  method 
is  developed  based  on  an  unsupervised  clustering  algorithm  (Ball  1965,  Therrien  1989). 
By  examining  the  characteristics  of  the  clusters  formed,  suitable  criterion  functions  can 
be  defined  for  both  classification  and  regression  problems.  Search  algorithms  can  then  be 
applied  to  find  the  “best”  subset  of  input  variables. 

Criterion  Function 

Given  a  data  set  {y,x}  where  y  is  the  dependent  variable,  and  x  =  {xi,X2,  ■■■,xd}  are 
the  independent  variables,  the  criterion  function  should  be  so  defined  that  it  provides  a 
measure  of  how  well  y  can  be  represented  by  a  mapping  from  x.  This  requirement  applies 
to  either  regression  problems  where  y  is  a  continuous  variable,  or  for  classification  problem 
where  y  is  a  discrete  set  of  class  labels. 
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The  construction  of  a  criterion  function  is  to  be  based  on  the  following  observations:  If  a 
one-to-one  mapping  exists  between  x  and  y,  then  a  set  of  points  forming  a  cluster  in  the 
X  space  should  also  resemble  a  cluster  in  the  y  space.  This  leads  to  that,  for  regression, 
the  variance  along  the  y  axis  will  be  small  for  the  set  of  points  on  this  cluster;  and  for 
classification,  the  cluster  should  overwhelmingly  represented  by  samples  from  a  particular 
class. 

Based  on  this  observation,  a  criterion  function  can  be  defined  in  the  following  manner: 

•  Assign  data  vectors  into  clusters  in  the  x  space.  This  can  be  achieved  by  a  clustering 
algorithm.  In  the  current  approach,  the  ISODATA  algorithm,  which  is  a  variation  of 
the  K-means  algorithm  with  heuristics  for  cluster  splitting  and  merging,  is  adopted. 

•  For  classification  problems,  define  the  criterion  function  as 

■'''  «=1 

where  N  is  the  total  number  of  points,  K  is  the  number  of  clusters,  and  nmax{i)  is 
the  number  of  points  in  the  class  that  are  most  represented  on  the  rth  cluster.  The 
ideal  situation  would  be  that  the  points  on  a  cluster  all  have  the  same  class  label,  in 
which  case  C'(x)  =  1. 

For  regression  problems,  the  criterion  is 

C.(x)  =  1  -  (10) 

where  is  the  variance  along  the  y  direction  of  all  the  points,  and  v{'x)  is  a  local 
variance  measure  defined  by 

-(X) = ipi:"*-')  (11) 

Here  n{i)  is  the  number  of  points  on  the  i:th  cluster,  and  the  (•)  is  the  y  variance  over 
this  set  of  points.  t;(x)  is  the  y  variance  of  a  cluster  averaged  over  all  clusters.  Cr 
is  defined  in  analogous  to  the  measure  in  the  linear  regression  theory.  Clearly  Cr 
approaches  to  unity  when  the  input-output  relationship  is  captured  perfectly  by  the 
clustering  mechanism. 


Search  Methods 

An  exhaustive  search  over  all  possible  variable  subsets  would  require  0(2^)  evaluations  of 
the  criterion  function.  This  is  possible  practically  only  when  the  number  of  variables  D 


5 


is  small.  When  D  is  larger  than  say  15,  more  efficient  search  algorithms  must  be  utilized. 
Forward  selection  and  backward  elimination  algorithms  (see  e.g.  Miller  1990)  have  been 
implemented. 

A  forward  selection  algorithm  starts  from  a  null  set  and  add  a  variable  one  by  one  as 
long  as  it  increases  the  value  of  the  criterion  function.  This  method  is  extremely  simple 
and  efficient.  Because  it  starts  from  a  low  dimension  space,  it  generally  produces  quite 
reliable  selections  if  the  number  of  good  variables  is  small  comparing  to  the  total  number 
of  candidate  variables.  The  ordering  of  the  variables  can  become  a  problem  though.  The 
algorithm  tends  to  pick  up  variables  placed  near  the  front  and  ignore  the  variables  near 
the  end. 

The  backward  elimination  algorithm  starts  from  a  full  variable  set.  A  variable  is  then 
eliminated  if  its  removal  does  not  cause  a  significant  deterioration  in  the  value  of  the 
criterion  function.  Specifically,  the  algorithm  consists  of  the  following  steps: 


Step  1:  With  all  D  candidate  variables  included,  calculate  the  corresponding  criterion 
function  value  C Rq- 

Step  2:  For  i  =  D,  D  —  1,  ...1,  construct  a  variable  subset  consisting  of  the  original  set 
of  variables  excluding  the  i:th  one,  compute  its  corresponding  criterion  value  CRi;  if 
CRo  —  CRi  <  0,  define  ir  =  i  and  break  the  loop.  The  ir:th  variable  is  a  target  for 
removal. 

Step  3;  If  ir  is  defined,  remove  the  i^rth  variable  permanently,  and  let  D  =  D  —  1.  Go  to 
step  1.  Otherwise  the  algorithm  halts. 


MATLAB  Implementations 


The  algorithms  mentioned  in  the  previous  sections  are  implemented  as  MATLAB  functions. 
The  following  conventions  have  been  used  in  the  implementation. 

The  actual  criterion  function  used  is  the  expression  defined  in  eqs.  (9)  and  (10)  plus  a 
term  penalizing  larger  number  of  input  variables.  Specifically  for  classification  problems, 
the  criterion  used  is 


C'{x)=C{x)-pcD-C{x)  (12) 

where  Pc  is  a  parameter  specifying  the  percentage  penalty  for  having  an  extra  input  variable. 
By  default  pc  =  0.005.  For  regression  problems  the  criterion  used  is 

C'r{x)=Cr{x)-prD-Cr{x)  (13) 
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By  default  =  O-Ol. 


A  data  set  is  represented  by  a  pair  of  MATLAB  variables  Y,  X,  where  X  is  a,  N  x  D 
matrix.  Each  row  of  X  corresponds  to  a  data  sample.  D  is  the  number  of  variables  and  N 
is  the  number  of  samples,  y  is  a  iV  x  1  vector  giving  the  values  of  the  dependent  variable. 
For  classification  problems,  the  value  of  y  must  be  specified  as  in  (1, 2, 3,...}  corresponding 
to  class  categories. 

A  variable  subset  is  represented  by  a  bit  vector,  e.g. 

/  =  {1,1,0, !,...!} 

where  the  i:th.  bit  is  set  to  1/0  if  the  i:th  variable  is  included/excluded.  This  is  similar  to 
the  binary  chromosome  representation  used  in  genetic  algorithms  (GA).  Variable  subset 
search  can  also  be  formulated  as  a  genetic  optimization  problem,  although  this  has  not 
been  done  in  this  report.  Borrowing  from  the  GA  jargon,  the  term  “criterion  function”  is 
used  interchangeably  with  “fitness  function” . 

The  typical  sequence  of  instructions  of  using  this  implementation  is  illustrated  in  the 
following. 

Some  of  the  MATLAB  functions  can  be  speeded  up  considerably  if  they  are  compiled  into 
CMEX  codes.  This  can  be  done  by 


>  kmcompil 

This  is  a  one-time  execution  that  needs  to  be  done  only  when  the  functions  are  ported  to 
a  new  machine.  MATLAB  compiler  required. 

Assume  the  data  set  is  stored  in  a  space-delimited  format  in  a  text  file  “yx.data” ,  the  first 
step  is  loading  the  data: 

>  load  yx.data 

The  data  matrix  is  then  partitioned  into  Y  and  X  part.  The  following  instructions  assume 
that  the  Y  variable  is  stored  in  the  first  column  of  yx  and  the  rest  gives  the  X  variables: 

>  [N,D]  =  size(yx); 

>  Y  =  yx(:,l); 

>  X  =  yx(:,2:D); 
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The  variables  in  X  may  need  to  be  scaled  appropriately  so  they  have  similar  magnitude 
and  variations.  For  classification  problems,  the  class  categories  in  Y  need  to  be  converted 
to  integers  between  1  to  K,  if  it  is  not  in  conformation  with  this  convention  already. 

Variable  search  is  invoked  by  one  of  the  following  function  calls.  For  classification  problems, 


>  kmfbclas(Y,X, ’forward’) ; 
Or, 

>  kmfbclas(Y,X, ’backward’) ; 
and  for  regression  problems, 

>  kinfbreg(Y,X, ’forward’) ; 


Or, 

>  kmf br eg ( Y , X , ’ backward ’ ) ; 


%  Forward  selection  algorithm 
*/,  Backward  elimination  algorithm 


7,  Forward  selection  algorithm 
7,  Backward  elimination  algorithm 


Sample  scripts  demonstrating  these  steps  are  given  as  democlas.m  and  demoreg.m. 

MATLAB  is  a  development  tool  and  as  such  it  is  not  the  best  platform  for  implementing  this 
type  of  algorithm  for  application.  In  the  current  implementation  speed  is  still  a  bottleneck. 
For  field  applications  it  is  desirable  to  implement  the  algorithm  in  a  dedicated  program 
paclcage  and  utilize  the  newest  generation  of  powerful  processors. 
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